Course:

Introduction to Open Data Science 2018

Short description:

Aim of this course is to demonstrate main principles of data handling, statistical methods, and data visualization. In the end of this course, students will be able to apply the principle of reproducible research. In addition, after this course, you will see the possibilities and opportunities of programming.

Data is everywhere, learn how to use it as your advantage!

Course contains weekly coding exercises:

DL’s

  • Sun 4.11 Ex1
  • Wed 7.11 peer-review
  • Sun 11.11 Ex2
  • Wed 14.11 peer-review
  • Sun 18.11 Ex3
  • Wed 21.11 peer-review
  • Sun 25.11 Ex4
  • Wed 28.11 peer-review
  • Sun 2.12 Ex5
  • Wed 5.12 peer-review
  • Sun 9.12 Ex6
  • Wed 12.12 peer-review

GitHub repository:

Link to my GitHub repo


Data analysis exercise

Data description

Data aims to model the relationship between learning approaches and students achievements in an introductory statistics course in Finland. There are total of 7 variables which explain the students achievements (points). Variables are “gender”, “Age”, “Attitude”,“Deep”,“Stra” and “Surf”.

Data dimensions

Data has 166 observations (students) and 7 variables. Hence, data is a table with 166 rows and 7 columns.

Variables:

  • Age; Age (in years) derived from the date of birth

  • Attitude; Global attitude toward statistics

  • Points; Exam points

  • gender; Gender: M (Male), F (Female)

  • Deep; Deep approach ~d_sm+d_ri+d_ue where, d_sm = Seeking Meaning d_ri = Relating Ideas d_ue = Use of Evidence

  • Surf; Surface approach ~su_lp+su_um+su_sb where, su_lp = Lack of Purpose su_um = Unrelated Memorising su_sb = Syllabus-boundness

  • Stra; Strategic approach ~st_os+st_tm where, st_os = Organized Studying st_tm = Time Management

Variable statistics


url = "http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/learning2014.txt"
data_stud <- read.delim(url, sep=",")
summary(data_stud)
##  gender       age           attitude          deep            stra      
##  F:110   Min.   :17.00   Min.   :1.400   Min.   :1.583   Min.   :1.250  
##  M: 56   1st Qu.:21.00   1st Qu.:2.600   1st Qu.:3.333   1st Qu.:2.625  
##          Median :22.00   Median :3.200   Median :3.667   Median :3.188  
##          Mean   :25.51   Mean   :3.143   Mean   :3.680   Mean   :3.121  
##          3rd Qu.:27.00   3rd Qu.:3.700   3rd Qu.:4.083   3rd Qu.:3.625  
##          Max.   :55.00   Max.   :5.000   Max.   :4.917   Max.   :5.000  
##       surf           points     
##  Min.   :1.583   Min.   : 7.00  
##  1st Qu.:2.417   1st Qu.:19.00  
##  Median :2.833   Median :23.00  
##  Mean   :2.787   Mean   :22.72  
##  3rd Qu.:3.167   3rd Qu.:27.75  
##  Max.   :4.333   Max.   :33.00

Data Visualization

Plots

url = "http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/learning2014.txt"
data_stud <- read.delim(url, sep=",")
boxplot(split(data_stud$points,data_stud$gender),main='Points by Gender')

boxplot(split(data_stud$attitude,data_stud$gender),main='Attitude by Gender')

boxplot(split(data_stud$deep,data_stud$gender),main='Deep by Gender')

boxplot(split(data_stud$stra,data_stud$gender),main='Stra by Gender')

boxplot(split(data_stud$surf,data_stud$gender),main='Surf by Gender')

pairs(~points+age+attitude+deep+stra+surf,data=data_stud, 
   main="Scatterplot Matrix")

library(corrplot)
## corrplot 0.84 loaded
data_stud$gender <- factor(data_stud$gender)
num_data <- data_stud[,c("points","age","attitude","deep","stra","surf")]
res <- cor(num_data)
corrplot(as.matrix(res), method="circle")



Comments on data

Gender

There seems to be slight variation in points between the genders:

url = "http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/learning2014.txt"
data_stud <- read.delim(url, sep=",")
tapply(data_stud$points, data_stud$gender, mean)
##        F        M 
## 22.32727 23.48214

Difference in mean points is (23.48214-22.32727) = 1.16

Conclusion: males tend to get slightly higher points than women.

Interesting feature between genders is that especially in attitude-variable (which describes general attitude towards statistics) is higher in males than females. Females on the other hand, are higher at stra (strategic approach: organizing studies and time management). Females might find lack of purpose in statistic related studies and studying it might include unrealted memorising because of it (Surf).

Age

When looking at age-variable, one should note that there is extremely high number of occurences 20-30-years category: it might give us bad image from how age truly affects the total points of student.

Correlation between points and age:

url = "http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/learning2014.txt"
data_stud <- read.delim(url, sep=",")
with(data_stud, cor(age, points))
## [1] -0.09319032

Correlation is negative.

Conclusion: older the student is - lower the points are. Although, correlation is not particularly strong.

Attitude, Stra, Deep and Surf

Correlations between students attitude and study habits, and exam points.

url = "http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/learning2014.txt"
data_stud <- read.delim(url, sep=",")
with(data_stud, cor(attitude, points))
## [1] 0.4365245
with(data_stud, cor(stra, points))
## [1] 0.1461225
with(data_stud, cor(deep, points))
## [1] -0.01014849
with(data_stud, cor(surf, points))
## [1] -0.1443564

The best variable which indicates students points is, by far, attitude. Correlation is 0.44.

Stra and Surf are give second best interpretiations: Students with strategic approach tend to get better points, on the other hand Surf decreases students points (Lack of Purpose, Unrelated Memorising, Syllabus-boundness).

Correlations between the descriptive variables

From correlation plot (with circles from red - blue), we are able to notice that Surf and Deep has strong negative correlation. Students that lack motivation, do not seek new ideas or meaning from statistics.

Strategic approach (Stra) correlates positively between all the variables, except Surf.

Describe the work you have done this week and summarize your learning.

  • Describe your work and results clearly.
  • Assume the reader has an introductory course level understanding of writing and reading R code as well as statistical methods
  • Assume the reader has no previous knowledge of your data or the more advanced methods you are using



Fitting regression model

Model parameters

I chose three explanatory variables which are:

  • Attitude
  • Stra
  • Age
url = "http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/learning2014.txt"
l_model <- lm(points ~ attitude + stra + age, data=data_stud)
summary(l_model)
## 
## Call:
## lm(formula = points ~ attitude + stra + age, data = data_stud)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18.1149  -3.2003   0.3303   3.4129  10.7599 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 10.89543    2.64834   4.114 6.17e-05 ***
## attitude     3.48077    0.56220   6.191 4.72e-09 ***
## stra         1.00371    0.53434   1.878   0.0621 .  
## age         -0.08822    0.05302  -1.664   0.0981 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.26 on 162 degrees of freedom
## Multiple R-squared:  0.2182, Adjusted R-squared:  0.2037 
## F-statistic: 15.07 on 3 and 162 DF,  p-value: 1.07e-08

What it does?

Multiple linear regression explains the relationship between descriptive variables (attitude, stra, age) and response variables (points). Model fits linear (linear line, bluntly) equation to observed data.

Results

As expected, attitude shows strong statistical significance in the model output. Attitudes p-value is clearly under 0.001 (4.7*10^-9, to be exact). Strategic approach (Stra) shows clear statistical significance: altough p-value is just a bit over 0.05, it is still rather excellent explanator for students points. Age is the worst explanator: p-value of age is high as 0.09.

Although, some of the explanatory variables show statistical significance in linear model, the R-squared value of the whole model stays rather low: 0.20.

Model: Points_est = 10.89543 + (3.48077attitude) + (1.00371stra) - (0.08822*age)



Re-model, without age

url = "http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/learning2014.txt"
l_model <- lm(points ~ attitude + stra, data=data_stud)
summary(l_model)
## 
## Call:
## lm(formula = points ~ attitude + stra, data = data_stud)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.6436  -3.3113   0.5575   3.7928  10.9295 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   8.9729     2.3959   3.745  0.00025 ***
## attitude      3.4658     0.5652   6.132 6.31e-09 ***
## stra          0.9137     0.5345   1.709  0.08927 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.289 on 163 degrees of freedom
## Multiple R-squared:  0.2048, Adjusted R-squared:  0.1951 
## F-statistic: 20.99 on 2 and 163 DF,  p-value: 7.734e-09

Results without age

Model got worse without age-variable. R-squared in new model 0.19. Conclusion: continue with attitude, stra and age.



Diagnostic plots

url = "http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/learning2014.txt"
l_model <- lm(points ~ attitude + stra + age, data=data_stud)
plot(l_model)

Diagnostics interpretation

Aim of the diagnostics plots is to explain how well our multilinear model is actually fitting the data.

Residuals vs Fitted

Scatterplot represents the correlation between model residuals and predicted values. If residuals are small, the model predicts well. On the other hand, if the scatterplot seems sparse, it indicates that model is not giving good predictions. Remember, out model R^2 was only 0.2.

QQ-plot

Normal Q-Q-plot is a normal probability plot: if errors in model follow normal distribution, line is straight. In our case, there seems to be slight divergence from normal distribution in both ends. Extremely low and high points are harder to predict with the model.

Residuals vs Leverage

Last plot describes the influence of observations in the model. We can interpret from the plot that obs. 56, 4 and 2 have the most influence to our model. We could go back to the original data and check is there something wrong with it, or make conclusion why these particular students have such a different point - attitude - stra - and age combinations. They are not clearly following the trend.


Logistic regression analysis

Data description

Data has 382 observations (students) and 35 variables. Hence, data is a table with 382 rows and 35 columns.

Variable-names and summary

Data describes students achievement in two Portuguese schools. Attributes give information from student’s social status and studying habits. Although, there are also several variables which describe students overall living habits and morality. Data has been provided regarding the student performance in subjects: Mathematics (mat) and Portuguese language (por).

Article: P. Cortez and A. Silva. Using Data Mining to Predict Secondary School Student Performance. In A. Brito and J. Teixeira Eds., Proceedings of 5th FUture BUsiness TEChnology Conference (FUBUTEC 2008) pp. 5-12, Porto, Portugal, April, 2008, EUROSIS, ISBN 978-9077381-39-7.

url = "http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/alc.txt"
alc_data <- read.delim(url, sep=",")
summary(alc_data)
##  school   sex          age        address famsize   Pstatus
##  GP:342   F:198   Min.   :15.00   R: 81   GT3:278   A: 38  
##  MS: 40   M:184   1st Qu.:16.00   U:301   LE3:104   T:344  
##                   Median :17.00                            
##                   Mean   :16.59                            
##                   3rd Qu.:17.00                            
##                   Max.   :22.00                            
##       Medu            Fedu             Mjob           Fjob    
##  Min.   :0.000   Min.   :0.000   at_home : 53   at_home : 16  
##  1st Qu.:2.000   1st Qu.:2.000   health  : 33   health  : 17  
##  Median :3.000   Median :3.000   other   :138   other   :211  
##  Mean   :2.806   Mean   :2.565   services: 96   services:107  
##  3rd Qu.:4.000   3rd Qu.:4.000   teacher : 62   teacher : 31  
##  Max.   :4.000   Max.   :4.000                                
##         reason    nursery   internet    guardian     traveltime   
##  course    :140   no : 72   no : 58   father: 91   Min.   :1.000  
##  home      :110   yes:310   yes:324   mother:275   1st Qu.:1.000  
##  other     : 34                       other : 16   Median :1.000  
##  reputation: 98                                    Mean   :1.442  
##                                                    3rd Qu.:2.000  
##                                                    Max.   :4.000  
##    studytime        failures      schoolsup famsup     paid     activities
##  Min.   :1.000   Min.   :0.0000   no :331   no :144   no :205   no :181   
##  1st Qu.:1.000   1st Qu.:0.0000   yes: 51   yes:238   yes:177   yes:201   
##  Median :2.000   Median :0.0000                                           
##  Mean   :2.034   Mean   :0.2906                                           
##  3rd Qu.:2.000   3rd Qu.:0.0000                                           
##  Max.   :4.000   Max.   :3.0000                                           
##  higher    romantic      famrel        freetime         goout      
##  no : 18   no :261   Min.   :1.00   Min.   :1.000   Min.   :1.000  
##  yes:364   yes:121   1st Qu.:4.00   1st Qu.:3.000   1st Qu.:2.000  
##                      Median :4.00   Median :3.000   Median :3.000  
##                      Mean   :3.94   Mean   :3.223   Mean   :3.113  
##                      3rd Qu.:5.00   3rd Qu.:4.000   3rd Qu.:4.000  
##                      Max.   :5.00   Max.   :5.000   Max.   :5.000  
##       Dalc            Walc          health         absences     
##  Min.   :1.000   Min.   :1.00   Min.   :1.000   Min.   : 0.000  
##  1st Qu.:1.000   1st Qu.:1.00   1st Qu.:3.000   1st Qu.: 0.000  
##  Median :1.000   Median :2.00   Median :4.000   Median : 3.000  
##  Mean   :1.474   Mean   :2.28   Mean   :3.579   Mean   : 5.319  
##  3rd Qu.:2.000   3rd Qu.:3.00   3rd Qu.:5.000   3rd Qu.: 8.000  
##  Max.   :5.000   Max.   :5.00   Max.   :5.000   Max.   :75.000  
##        G1              G2              G3           alc_use     
##  Min.   : 3.00   Min.   : 0.00   Min.   : 0.00   Min.   :1.000  
##  1st Qu.: 8.00   1st Qu.: 8.25   1st Qu.: 8.00   1st Qu.:1.000  
##  Median :10.50   Median :11.00   Median :11.00   Median :1.500  
##  Mean   :10.86   Mean   :10.71   Mean   :10.39   Mean   :1.877  
##  3rd Qu.:13.00   3rd Qu.:13.00   3rd Qu.:14.00   3rd Qu.:2.500  
##  Max.   :19.00   Max.   :19.00   Max.   :20.00   Max.   :5.000  
##   high_use      
##  Mode :logical  
##  FALSE:270      
##  TRUE :112      
##                 
##                 
## 


Choice of four interesting data variables


1. Sex - student’s sex (binary: ‘F’ - female or ‘M’ - male)

It’s well-known, that males tend to drink more alcohol than women (especially in “macho-cultures”). Hypothesis: Alcohol consumption is generally higher in “M”-class, thus being male increases the odds to use high amounts of alcohol.


2. Studytime - weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours)

Students, who are willing to study and work hard, do not generally have time for drinking. Drinking alcohol decreases the students moral to study. Hypothesis:: Alcohol consumption decreases when studytime rises. High studytime decreases the odds to alcohol consumption.


3. Fjob - father’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’)

I suggest, that father’s job and social status has strong correlation between childrens achievements. Father’s alcohol consumption might have even clearer correlation between students drinking habits: child learns from example! Hypothesis: As the fathers social status gets higher, less the children are drinking and getting other bad habits. Father’s job decreases the students odds to high alcohol consumption.


4. Higher - wants to take higher education (binary: yes or no)

When students are ambitious with their studies, they are less likely spending their time drinking alcohol. Hypothesis: answering “yes” in this variable decreases the odds to drinking alcohol.


Data visualization

Attiributes overview

url = "http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/alc.txt"
alc_data <- read.delim(url, sep=",")
library(tidyr); library(dplyr); library(ggplot2)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
gather(alc_data) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free")+ geom_bar()
## Warning: attributes are not identical across measure variables;
## they will be dropped


Choise of variables: Sex

url = "http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/alc.txt"
alc_data <- read.delim(url, sep=",")
library(tidyr); library(dplyr); library(ggplot2)
ggplot() + 
  geom_density(data=alc_data, aes(x=alc_use, group=sex, fill=sex),alpha=0.5, adjust=2) + 
  xlab("Alcohol use (1-5)") +
  ylab("Density")

Crosstab

url = "http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/alc.txt"
alc_data <- read.delim(url, sep=",")
alc_data$sex <- factor(alc_data$sex)
library(tidyr); library(dplyr); library(ggplot2); library(descr)
CrossTable(alc_data$high_use, alc_data$sex)
##    Cell Contents 
## |-------------------------|
## |                       N | 
## | Chi-square contribution | 
## |           N / Row Total | 
## |           N / Col Total | 
## |         N / Table Total | 
## |-------------------------|
## 
## ==========================================
##                      alc_data$sex
## alc_data$high_use        F       M   Total
## ------------------------------------------
## FALSE                  157     113     270
##                      2.078   2.236        
##                      0.581   0.419   0.707
##                      0.793   0.614        
##                      0.411   0.296        
## ------------------------------------------
## TRUE                    41      71     112
##                      5.009   5.390        
##                      0.366   0.634   0.293
##                      0.207   0.386        
##                      0.107   0.186        
## ------------------------------------------
## Total                  198     184     382
##                      0.518   0.482        
## ==========================================


There seems to be somewhat clear division in alcohol consumption between males and females.Females alcohol consumption is focused on classes 1 and 2, when males drinking is more divided through all the consumption classes. It seems to be really rare to find a portuguese females who report themselves into alc use classes 4-5.


Choise of variables: Studytime

url = "http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/alc.txt"
alc_data <- read.delim(url, sep=",")
alc_data$studytime <- factor(alc_data$studytime)
library(tidyr); library(dplyr); library(ggplot2)
ggplot() + 
  geom_density(data=alc_data, aes(x=alc_use, group=studytime, color=studytime),alpha=1, adjust=2, size=2) + 
  xlab("Alcohol use (1-5)") +
  ylab("Density")

Crosstab

url = "http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/alc.txt"
alc_data <- read.delim(url, sep=",")
alc_data$studytime <- factor(alc_data$studytime)
library(tidyr); library(dplyr); library(ggplot2); library(descr)
CrossTable(alc_data$high_use, alc_data$studytime)
##    Cell Contents 
## |-------------------------|
## |                       N | 
## | Chi-square contribution | 
## |           N / Row Total | 
## |           N / Col Total | 
## |         N / Table Total | 
## |-------------------------|
## 
## ==========================================================
##                      alc_data$studytime
## alc_data$high_use        1       2       3       4   Total
## ----------------------------------------------------------
## FALSE                   60     133      54      23     270
##                      2.251   0.012   2.364   0.804        
##                      0.222   0.493   0.200   0.085   0.707
##                      0.583   0.700   0.871   0.852        
##                      0.157   0.348   0.141   0.060        
## ----------------------------------------------------------
## TRUE                    43      57       8       4     112
##                      5.426   0.030   5.699   1.937        
##                      0.384   0.509   0.071   0.036   0.293
##                      0.417   0.300   0.129   0.148        
##                      0.113   0.149   0.021   0.010        
## ----------------------------------------------------------
## Total                  103     190      62      27     382
##                      0.270   0.497   0.162   0.071        
## ==========================================================


Studytime has correlation to alcohol consumption: as the studytime rises, the probability to use alcohol decreases. Students who use exceptionally great amounts of time studying (class 3 & 4) seem to be using almost no alcohol, mostly classes 1-2. Fancy thing is, that students who are in studytime class 3 are not participating in heavy drinking (class 5). These people could be described as “middle road walkers”.


Choise of variables: Father’s job

url = "http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/alc.txt"
alc_data <- read.delim(url, sep=",")
alc_data$Fjob <- factor(alc_data$Fjob)
library(tidyr); library(dplyr); library(ggplot2)
ggplot() + 
  geom_density(data=alc_data, aes(x=alc_use, group=Fjob, color=Fjob),alpha=1, adjust=2, size=2) + 
  xlab("Alcohol use (1-5)") +
  ylab("Density")

Crosstab

url = "http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/alc.txt"
alc_data <- read.delim(url, sep=",")
alc_data$Fjob <- factor(alc_data$Fjob)
library(tidyr); library(dplyr); library(ggplot2); library(descr)
CrossTable(alc_data$high_use, alc_data$Fjob)
## Warning in chisq.test(tab, correct = FALSE, ...): Chi-squared approximation
## may be incorrect
##    Cell Contents 
## |-------------------------|
## |                       N | 
## | Chi-square contribution | 
## |           N / Row Total | 
## |           N / Col Total | 
## |         N / Table Total | 
## |-------------------------|
## 
## ==========================================================================
##                      alc_data$Fjob
## alc_data$high_use    at_home   health   other   services   teacher   Total
## --------------------------------------------------------------------------
## FALSE                     14       13     149         69        25     270
##                        0.640    0.081   0.000      0.581     0.435        
##                        0.052    0.048   0.552      0.256     0.093   0.707
##                        0.875    0.765   0.706      0.645     0.806        
##                        0.037    0.034   0.390      0.181     0.065        
## --------------------------------------------------------------------------
## TRUE                       2        4      62         38         6     112
##                        1.544    0.194   0.000      1.400     1.050        
##                        0.018    0.036   0.554      0.339     0.054   0.293
##                        0.125    0.235   0.294      0.355     0.194        
##                        0.005    0.010   0.162      0.099     0.016        
## --------------------------------------------------------------------------
## Total                     16       17     211        107        31     382
##                        0.042    0.045   0.552      0.280     0.081        
## ==========================================================================


Contrary to my earlier hypothesis, father’s job does not seem to have major influence on student’s alcohol consumption. All of the job-classes are divided fairly identical. Interesting feature in father’s job attribute is the “health”-class, which has high altitude in alcohol use 1. On the other hand, there is minor peak in classes 3-4. This could indicate that students whose fathers’ work at ‘health’ care related -jobs tend to use much less alcohol in general, but there are few rebellious students, who stand against their fathers’ health doctrines.


Choise of variables: Higher education ambitions

url = "http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/alc.txt"
alc_data <- read.delim(url, sep=",")
alc_data$higher <- factor(alc_data$higher)
library(tidyr); library(dplyr); library(ggplot2)
ggplot() + 
  geom_density(data=alc_data, aes(x=alc_use, group=higher, color=higher),alpha=1, adjust=2, size=2) + 
  xlab("Alcohol use (1-5)") +
  ylab("Density")

Crosstab

url = "http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/alc.txt"
alc_data <- read.delim(url, sep=",")
alc_data$higher <- factor(alc_data$higher)
library(tidyr); library(dplyr); library(ggplot2); library(descr)
CrossTable(alc_data$high_use, alc_data$higher)
##    Cell Contents 
## |-------------------------|
## |                       N | 
## | Chi-square contribution | 
## |           N / Row Total | 
## |           N / Col Total | 
## |         N / Table Total | 
## |-------------------------|
## 
## ==========================================
##                      alc_data$higher
## alc_data$high_use       no     yes   Total
## ------------------------------------------
## FALSE                    9     261     270
##                      1.089   0.054        
##                      0.033   0.967   0.707
##                      0.500   0.717        
##                      0.024   0.683        
## ------------------------------------------
## TRUE                     9     103     112
##                      2.626   0.130        
##                      0.080   0.920   0.293
##                      0.500   0.283        
##                      0.024   0.270        
## ------------------------------------------
## Total                   18     364     382
##                      0.047   0.953        
## ==========================================


Students ambitions for higher education, does indeed seem to have slight influence on students drinking habits: Higher education seem to attract students who are not using alcohol as much as students who do not have motivation for higher education. Alc. use distribution inside group: “no”, is quite incoherent, and divided more evenly than group: “yes”. It seems that there are students who do not use alcohol but do not either have plans for higher education.



Logistic model

Logistic regression is traditionally used for modelling regression between predictors (continuous, categorical or a mix of both) and categorical variable, y, which is binary (0/1). Model describes the probability (or odds) for variable “y”, to get value = “1”, with set of predictors. In our case, we are modelling the probability for student to have high alcohol consumption (1), with predictors: sex, studytime, father’s job and ambition for higher education.

Training & Test datasets


Before applying logistic regression model, we need to split the data into training and testing data. I decided to choose first 300 students as training set for the model. Thus, 82 reimaining students are left for testing the model.

url = "http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/alc.txt"
alc_data <- read.delim(url, sep=",")
train <- alc_data[1:300,]
test <- alc_data[301:382,]

trainset = [1] 300 35

testset = [1] 82 35


Building model


url = "http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/alc.txt"
alc_data <- read.delim(url, sep=",")
train <- alc_data[1:300,]
train$sex <- factor(train$sex)
train$studytime <- factor(train$studytime)
train$Fjob <- factor(train$Fjob)
train$higher <- factor(train$higher)
train$high_use <- factor(train$high_use)
model <- glm(high_use ~ sex + studytime + Fjob + higher,family=binomial(link='logit'),data=train)
summary(model)
## 
## Call:
## glm(formula = high_use ~ sex + studytime + Fjob + higher, family = binomial(link = "logit"), 
##     data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.2661  -0.9208  -0.6262   1.0912   2.2832  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)   
## (Intercept)   -1.8926     1.2056  -1.570  0.11645   
## sexM           0.5782     0.2841   2.035  0.04183 * 
## studytime2    -0.2667     0.3023  -0.882  0.37766   
## studytime3    -1.7878     0.6629  -2.697  0.00699 **
## studytime4    -1.4421     0.6781  -2.127  0.03344 * 
## Fjobhealth     1.4410     1.2415   1.161  0.24576   
## Fjobother      1.2687     1.0925   1.161  0.24551   
## Fjobservices   1.8110     1.1074   1.635  0.10199   
## Fjobteacher    0.3926     1.2103   0.324  0.74566   
## higheryes     -0.2904     0.5513  -0.527  0.59834   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 361.29  on 299  degrees of freedom
## Residual deviance: 329.60  on 290  degrees of freedom
## AIC: 349.6
## 
## Number of Fisher Scoring iterations: 5

Summary of the model

From logistic model summary-table we are able to recognize that predictors: sex and studytime have statistical significance in the model. Although, studytime has significance only inside two of it’s terms (3 and 4).

Sex: According to our model, being male as student increases the odds for high alcohol use by 0.57, when compared to women. P-value is slightly under the 0.05 treshold-value, 0.04183.

Studytime: Increasing studytime does indeed lower your odds to have bad drinking habits: however, studytime truly begins to decrease odds when student moves from term 2 to term 3. To put it simple, students who study 5 to 10 hours or more on weekly basis, have considerably smaller odds to have high alcohol use. Studytime 3 p-value is 0.00699 (lowest value in the model) and correspondingly studytime 4, 0.03344.

Father’s job: No statistical significance in any of the predictor terms. However, it seems clear that, students whose fathers are working as teachers are less likely to become alcoholists.

Pursue for higher education: No statistical significance, but having ambition for higher education lowers the odds for high alcohol use by -0.29.


Anova-table and McFadden R2 -index

url = "http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/alc.txt"
alc_data <- read.delim(url, sep=",")
train <- alc_data[1:300,]
train$sex <- factor(train$sex)
train$studytime <- factor(train$studytime)
train$Fjob <- factor(train$Fjob)
train$higher <- factor(train$higher)
train$high_use <- factor(train$high_use)
model <- glm(high_use ~ sex + studytime + Fjob + higher,family=binomial(link='logit'),data=train)

anova(model)
## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: high_use
## 
## Terms added sequentially (first to last)
## 
## 
##           Df Deviance Resid. Df Resid. Dev
## NULL                        299     361.29
## sex        1   8.4311       298     352.86
## studytime  3  13.9372       295     338.92
## Fjob       4   9.0447       291     329.88
## higher     1   0.2747       290     329.60
library(pscl)
## Classes and Methods for R developed in the
## Political Science Computational Laboratory
## Department of Political Science
## Stanford University
## Simon Jackman
## hurdle and zeroinfl functions by Achim Zeileis
pR2(model)
##           llh       llhNull            G2      McFadden          r2ML 
## -164.80162560 -180.64550478   31.68775835    0.08770702    0.10023878 
##          r2CU 
##    0.14317798


Anova-table describes, how well our logistic model predicts high use of alcohol when compared to null-model. Null-model uses only intercept value.

Table shows that, largest difference to deviance comes from studytime. On contrast, students ambitions to higher education, does make small difference to model (0.27).

Furthermore, model fit can be assessed with McFadden’s pseudo R^2-value, which gives only 0.08 for our model.

The difference between the null deviance and the residual deviance shows how our model is doing against the null model (a model with only the intercept).


Coefficients as odds ratios and confidence intervals

url = "http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/alc.txt"
alc_data <- read.delim(url, sep=",")
train <- alc_data[1:300,]
train$sex <- factor(train$sex)
train$studytime <- factor(train$studytime)
train$Fjob <- factor(train$Fjob)
train$higher <- factor(train$higher)
train$high_use <- factor(train$high_use)
model <- glm(high_use ~ sex + studytime + Fjob + higher,family=binomial(link='logit'),data=train)
exp(cbind("Odds ratio" = coef(model), confint.default(model, level = 0.95)))
##              Odds ratio      2.5 %     97.5 %
## (Intercept)   0.1506773 0.01418480  1.6005605
## sexM          1.7829081 1.02160476  3.1115372
## studytime2    0.7658945 0.42348388  1.3851635
## studytime3    0.1673226 0.04563802  0.6134542
## studytime4    0.2364228 0.06258757  0.8930806
## Fjobhealth    4.2250405 0.37072282 48.1517881
## Fjobother     3.5563949 0.41790531 30.2650962
## Fjobservices  6.1164281 0.69798147 53.5984042
## Fjobteacher   1.4807783 0.13814056 15.8729946
## higheryes     0.7479459 0.25386006  2.2036669


Table above shows odds ratio of each model predictor. Being male increases the probability to have high alcohol consumption by 1.78. All of the studytime terms (2-4) decrease the odds. Odds ratios are based on asymptonic normality.


Data clustering and classification

This exercise focuses on how to classify and cluster data into categories according to characteristics of the data. Clustering methods try to find the similarities from the data and on the other what are the attributes which divide data?


Data overview


library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
data(Boston)
dim(Boston)
## [1] 506  14
head(Boston)
##      crim zn indus chas   nox    rm  age    dis rad tax ptratio  black
## 1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3 396.90
## 2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8 396.90
## 3 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8 392.83
## 4 0.03237  0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7 394.63
## 5 0.06905  0  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7 396.90
## 6 0.02985  0  2.18    0 0.458 6.430 58.7 6.0622   3 222    18.7 394.12
##   lstat medv
## 1  4.98 24.0
## 2  9.14 21.6
## 3  4.03 34.7
## 4  2.94 33.4
## 5  5.33 36.2
## 6  5.21 28.7


“Boston” -dataset describes social and environmental characteristics of Boston city.

Boston data has 506 rows (observations) and 14 columns (attributes).

Variable explanations

  • CRIM - per capita crime rate by town

  • ZN - proportion of residential land zoned for lots over 25,000 sq.ft.

  • INDUS - proportion of non-retail business acres per town.

  • CHAS - Charles River dummy variable (1 if tract bounds river; 0 otherwise)

  • NOX - nitric oxides concentration (parts per 10 million)

  • RM - average number of rooms per dwelling

  • AGE - proportion of owner-occupied units built prior to 1940

  • DIS - weighted distances to five Boston employment centres

  • RAD - index of accessibility to radial highways

  • TAX - full-value property-tax rate per $10,000

  • PTRATIO - pupil-teacher ratio by town

  • B - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town

  • LSTAT - % lower status of the population

  • MEDV - Median value of owner-occupied homes in $1000’s

***Explanations https://www.cs.toronto.edu/~delve/data/boston/bostonDetail.html.***

Variable visualization

library(MASS); library("GGally")
## 
## Attaching package: 'GGally'
## The following object is masked from 'package:dplyr':
## 
##     nasa
data(Boston)
Boston$chas <- factor(Boston$chas)
lowerFn <- function(data, mapping, ...) {
  p <- ggplot(data = data, mapping = mapping) +
    geom_point(color = 'blue', alpha=0.3, size=1) +
    geom_smooth(color = 'black', method='lm', size=1,...)
  p
}

g <- ggpairs( 
  data = Boston,
  lower = list(
    continuous =  wrap(lowerFn) #wrap("smooth", alpha = 0.3, color = "blue", lwd=1) 
  ),
  upper = list(continuous = wrap("cor", size = 4))
)
g <- g + theme(
  axis.text = element_text(size = 4),
  axis.title = element_text(size = 4),
  legend.background = element_rect(fill = "white"),
  panel.grid.major = element_line(colour = NA),
  panel.grid.minor = element_blank(),
  panel.background = element_rect(fill = "grey95")
)
print(g, bottomHeightProportion = 0.5, leftWidthProportion = .5)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.


Summary

library(MASS);
data(Boston)
summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00


Data interpretation

The data consists mainly of continuous variables; only categorical variable is “CHAS” -attribute, which gets binary values (0/1). CHAS equals to 1 if tract bounds Charles River, 0 otherwise. Other variables are ratios and or proportions which describe the city districts: social, economical and environmental factors affecting the price of houses and apartments in Boston. From graphical overview, we are able to recognize several strong correlations between the variables: in this interpretation, I will be discussing only the major trends in the data.

The strongest correlation in the data can be found between MEDV (house price) and LSTAT (% lower status of pop.) - correlation of -0.74. As the lower status pop. increases the median housing price decreases rapidly.

RM (average number of rooms per dwelling) has strong positive correlation (0.70) with MEDV. This is also pretty obvious correlation - more rooms (and space), more costly housing. PRATIO has strong negative correlation (-0.51) with MEDV. PRATIO describes the pupil -teacher ratio by town. Attribute reflects the overall social status in the town: there is often less funds to offer for schools at less popular and poorer parts of city - class-sizes tend to grow. At socially “better” areas there are private schools, which have lower PRATIO.

Other correlations between the study variables worth to mention: AGE (proportion of owner-occupied units built prior to 1940) of the houses seems to correlate between LSTAT and DIS (weighted distances to five Boston employment centres). Lower status population is living in older houses (prior -40’s), because they can’t simply afford new one and houses at that age are located far from Boston employment centres.


Standardizing data

Scale-function standardizes data’s each observation by column average and standard deviatian.

  • Scaled_value = value - col_mean / col_std


library(MASS);
data(Boston)
boston_scaled <- scale(Boston)
summary(boston_scaled)
##       crim                 zn               indus        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202  
##       chas              nox                rm               age         
##  Min.   :-0.2723   Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331  
##  1st Qu.:-0.2723   1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366  
##  Median :-0.2723   Median :-0.1441   Median :-0.1084   Median : 0.3171  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.2723   3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059  
##  Max.   : 3.6648   Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164  
##       dis               rad               tax             ptratio       
##  Min.   :-1.2658   Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047  
##  1st Qu.:-0.8049   1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876  
##  Median :-0.2790   Median :-0.5225   Median :-0.4642   Median : 0.2746  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6617   3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058  
##  Max.   : 3.9566   Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372  
##      black             lstat              medv        
##  Min.   :-3.9033   Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.: 0.2049   1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median : 0.3808   Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4332   3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 0.4406   Max.   : 3.5453   Max.   : 2.9865
boston_scaled <- as.data.frame(boston_scaled)


From summary we are able to recognize the result of standardization: Min. values in in each variable are negative, since all the values below mean of the column are now below zero. Also, higher the deviation in col. - smaller the min. value is. On the other hand, the max. is higher if there is small deviation in data. Mean value is always 0 in each variable.


Creating categorical variable from CRIM


library(MASS); library(dplyr, warn.conflicts = FALSE)
data(Boston)
boston_scaled <- scale(Boston)
boston_scaled <- as.data.frame(boston_scaled)
bins <- quantile(boston_scaled$crim)

# create a categorical variable 'crime'
labels <- c("low","med_low","med_high","high")
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, labels = labels)
levels(crime)
## [1] "low"      "med_low"  "med_high" "high"
table(crime)
## crime
##      low  med_low med_high     high 
##      127      126      126      127
# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)

# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)


In the code above, CRIM -varible has been transformed into categorical (4-levels) using the quantiles of standardized CRIM. We basically divided the data into 4 groups, where first group holds values under the lower quantile (0.25), second group values between lower and mean (0.25-0.5) and so on. I renamed the levels into more appropriate. Since we standardized the data earlier, all the classes have exactly same amount of observations (low and high has 127, because data cant be divided with 4 equally)

New


Linear discriminant analysis (training and test-set inc.)

In order to create proper predictative models, we need to split our data into training and testing datasets. In our case, we split our data into training (80 %) and test-sets (20 %) with random sample and remove crime-variable from test-set. Next we fit the data with discriminant analysis where crime is target variable and all the other variables in the dataset as predictor variables. LDA is a data classification method, where we aim to find linear combination of variables which separate target variable classes (crime-levels).


# boston_scaled is available
library(MASS); library(dplyr, warn.conflicts = FALSE)
data(Boston)
boston_scaled <- scale(Boston)
boston_scaled <- as.data.frame(boston_scaled)
bins <- quantile(boston_scaled$crim)
labels <- c("low","med_low","med_high","high")
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, labels = labels)
boston_scaled <- dplyr::select(boston_scaled, -crim)
boston_scaled <- data.frame(boston_scaled, crime)


# number of rows in the Boston dataset 
n <- dim(boston_scaled)[1]

# choose randomly 80% of the rows
ind <- sample(n,  size = n * 0.8)

# create train set
train <- boston_scaled[ind,]

# create test set 
test <- boston_scaled[-ind,]

# save the correct classes from test data
correct_classes <- test$crime
# remove the crime variable from test data
test <- dplyr::select(test, -crime)


# linear discriminant analysis
lda.fit <- lda(crime ~ ., data = train)

# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##      low  med_low med_high     high 
## 0.269802 0.250000 0.230198 0.250000 
## 
## Group means:
##                   zn      indus          chas        nox         rm
## low       1.01149275 -0.9041165 -0.1278483251 -0.8837112  0.4624630
## med_low  -0.05019056 -0.3381096  0.0395204559 -0.5716425 -0.0787730
## med_high -0.36644653  0.2175185  0.2356838659  0.4951792  0.1070958
## high     -0.48724019  1.0149946  0.0005392655  1.0391462 -0.3805584
##                 age        dis        rad        tax     ptratio
## low      -0.8771076  0.8495517 -0.6857985 -0.7646398 -0.38585272
## med_low  -0.3192240  0.3797467 -0.5384037 -0.4578098 -0.03799923
## med_high  0.5094399 -0.4456141 -0.4557992 -0.3414618 -0.50047014
## high      0.8035902 -0.8567003  1.6596029  1.5294129  0.80577843
##                black       lstat        medv
## low       0.38110814 -0.78611638  0.56442189
## med_low   0.31206388 -0.14857511  0.02280804
## med_high  0.06915263  0.05167016  0.21541261
## high     -0.67010018  0.81546327 -0.68501269
## 
## Coefficients of linear discriminants:
##                 LD1         LD2         LD3
## zn       0.12744564  0.57789314 -1.07179809
## indus    0.07777846 -0.22933493  0.05521080
## chas    -0.07521587 -0.02866796  0.15741680
## nox      0.17602658 -0.80450764 -1.14805110
## rm      -0.16487470 -0.09883885 -0.08391882
## age      0.17504622 -0.40745764 -0.05776690
## dis     -0.15787695 -0.30063291  0.39539937
## rad      4.09952460  0.92556772 -0.55152113
## tax      0.05204753 -0.06035758  1.08421286
## ptratio  0.15970219  0.18467971 -0.21682399
## black   -0.15259442  0.07014070  0.10758938
## lstat    0.19126940 -0.24350027  0.38815400
## medv     0.23136379 -0.35076957 -0.10166842
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9598 0.0309 0.0093
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(train$crime)

# plot the lda results
plot(lda.fit, dimen = 2)
lda.arrows(lda.fit, myscale = 2)


LDA predictions


# boston_scaled is available
library(MASS); library(dplyr, warn.conflicts = FALSE)
data(Boston)
boston_scaled <- scale(Boston)
boston_scaled <- as.data.frame(boston_scaled)
bins <- quantile(boston_scaled$crim)
labels <- c("low","med_low","med_high","high")
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, labels = labels)
boston_scaled <- dplyr::select(boston_scaled, -crim)
boston_scaled <- data.frame(boston_scaled, crime)
n <- dim(boston_scaled)[1]
ind <- sample(n,  size = n * 0.8)
train <- boston_scaled[ind,]
test <- boston_scaled[-ind,]
correct_classes <- test$crime
test <- dplyr::select(test, -crime)
lda.fit <- lda(crime ~ ., data = train)
classes <- as.numeric(train$crime)


# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)

# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       20       5        2    0
##   med_low    5      17        5    0
##   med_high   0       7       16    1
##   high       0       0        0   24


Table above shows that our model is predicting crime-rates fairly well: it seems to predict the lowest and highest proportions really well. On contrary, model struggles with medium classes. Earlier LDA-plot shows that there exists med-low and med-high points near the low- and high -class concentrations. It seems data should be categorized into low/high -classes instead on 4.


Clustering the data (euclidean dist.)

Calculating imaginary distance between the observations is one method to sort data: in the code I standardize the original data and calculate euclidean distance -matrix and later the manhattan distance -matrix. Matrices has pairwise distances between observations (city districts) and it would be too heavy to summarize properly. I have printed the short, summary-tables with descriptive statistical variables. It seems that manhattan distance (below) has generally longer distances: it’s pretty obvious, since in manhattan distance, we are not choosing the “straight path”. (More from https://www.quora.com/What-is-Manhattan-Distance)


library(MASS); library(dplyr, warn.conflicts = FALSE)
data(Boston)
boston_scaled <- scale(Boston)
boston_scaled <- as.data.frame(boston_scaled)

# euclidean distance matrix
dist_eu <- dist(boston_scaled)

# look at the summary of the distances
summary(dist_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970
# manhattan distance matrix
dist_man <- dist(boston_scaled, method ="manhattan")

# look at the summary of the distances
summary(dist_man)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2662  8.4832 12.6090 13.5488 17.7568 48.8618

Clustering the data (Kmeans, clusters = 4)

Kmeans -classification is one of the most common function to categorize data. It’s been used in many fields, such as image analysis: pixel values in image can be clustered fairly easily with Kmeans. Success of the method relies on many aspects such as the homogenity of the the groups. Below I calculate Kmeans-clustering for standardized Boston -data, with 4 clusters.

library(MASS); library(dplyr, warn.conflicts = FALSE)
data(Boston)
boston_scaled <- scale(Boston)
boston_scaled <- as.data.frame(boston_scaled)
# k-means clustering
km <-kmeans(Boston, centers = 4)

# plot the Boston dataset with clusters
pairs(Boston, col = km$cluster,main="Clusters = 4")

pairs(Boston[6:10], col = km$cluster,main="Clusters = 4")


Clustering the data (Kmeans, clusters = 3, 2)

While observing the pairs-plot, it seems rather clear, that clustering data with four groups is not providing very good results: it looks like in most variables 2 groups would do the trick. Lets try with 3 clusters and then by 2.

library(MASS); library(dplyr, warn.conflicts = FALSE)
data(Boston)
boston_scaled <- scale(Boston)
boston_scaled <- as.data.frame(boston_scaled)
# k-means clustering
km <-kmeans(Boston, centers = 3)

# plot the Boston dataset with clusters
pairs(Boston, col = km$cluster, main="Clusters = 3")

pairs(Boston[6:10], col = km$cluster, main="Clusters = 3")

km <-kmeans(Boston, centers = 2)
# plot the Boston dataset with clusters
pairs(Boston, col = km$cluster, main="Clusters = 2")

pairs(Boston[6:10], col = km$cluster, main="Clusters = 2")


Finding the optimal number of clusters

For the last part, we will use the within cluster sum of squares (WCSS) to optimize the number of clusters in data. Optimal number is found when the WCSS-value dives in cluster - WCSS -plot. Method produces different solution every time, since function sets it’s “seed-points” randomly.


library(MASS); library(dplyr, warn.conflicts = FALSE)
data(Boston)
boston_scaled <- scale(Boston)
boston_scaled <- as.data.frame(boston_scaled)
# MASS, ggplot2 and Boston dataset are available
set.seed(123)

# determine the number of clusters
k_max <- 10

# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(Boston, k)$tot.withinss})

# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')

# k-means clustering
km <-kmeans(Boston, centers = 2)

# plot the Boston dataset with clusters
pairs(Boston, col = km$cluster)


Conclusion

Optimal number of clusters = 2

Cluster-WCSS -plot shows that the optimal number of clusters seems to be at 2. Line takes radical drop at that point, adding clusters does not seem to add value to clustering.



Dimensionality reduction techniques

Statistical problems are often multidimensional: in order to understand different phenomena, it is necessary to “chop” variables into smaller pieces. This is what dimensionality reduction techniques aims for - we take original data variables and transform them into few components that collect together as much variance as possible from the original data. Thus, components make data interpretation, plotting etc. easier and more informative.


Data overview


human <- read.table("http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/human2.txt",sep  =",", header = T)

dim(human)
## [1] 155   8
head(human)
##               Edu2.FM   Labo.FM Edu.Exp Life.Exp   GNI Mat.Mor Ado.Birth
## Norway      1.0072389 0.8908297    17.5     81.6 64992       4       7.8
## Australia   0.9968288 0.8189415    20.2     82.4 42261       6      12.1
## Switzerland 0.9834369 0.8251001    15.8     83.0 56431       6       1.9
## Denmark     0.9886128 0.8840361    18.7     80.2 44025       5       5.1
## Netherlands 0.9690608 0.8286119    17.9     81.6 45435       6       6.2
## Germany     0.9927835 0.8072289    16.5     80.9 43919       7       3.8
##             Parli.F
## Norway         39.6
## Australia      30.5
## Switzerland    28.5
## Denmark        38.0
## Netherlands    36.9
## Germany        36.9
str(human)
## 'data.frame':    155 obs. of  8 variables:
##  $ Edu2.FM  : num  1.007 0.997 0.983 0.989 0.969 ...
##  $ Labo.FM  : num  0.891 0.819 0.825 0.884 0.829 ...
##  $ Edu.Exp  : num  17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
##  $ Life.Exp : num  81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
##  $ GNI      : int  64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
##  $ Mat.Mor  : int  4 6 6 5 6 7 9 28 11 8 ...
##  $ Ado.Birth: num  7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
##  $ Parli.F  : num  39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...


Human -data consists of 9 attributes and 195 observations (countries). Varibales describe countries “Human development index”: The Human Development Index (HDI) is a summary measure of average achievement in key dimensions of human development: a long and healthy life, being knowledgeable and have a decent standard of living. The HDI is the geometric mean of normalized indices for each of the three dimensions. (http://hdr.undp.org)


Variables

  • Edu2.FM (Education index (females))

  • Labo.FM (Labour force participation rate (females))

  • Edu.Exp (Expected years at schooling)

  • Life.Exp (Life expectancy index)

  • GNI (Gross national income ($))

  • Mat.Mor (Material morality)

  • Ado.Birth (Adolescent birth rate)

  • Parli.F (Parliamentary representation (females))


Data visualization


library("GGally")
human <- read.table("http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/human2.txt",sep  =",", header = T)
human = transform(human, GNI = as.numeric(GNI))
human = transform(human, Mat.Mor = as.numeric(Mat.Mor))


lowerFn <- function(data, mapping, ...) {
  p <- ggplot(data = data, mapping = mapping) +
    geom_point(color = 'blue', alpha=0.3, size=1) +
    geom_smooth(color = 'black', method='lm', size=1,...)
  p
}

g <- ggpairs( 
  data = human,
  lower = list(
    continuous =  wrap(lowerFn) #wrap("smooth", alpha = 0.3, color = "blue", lwd=1) 
  ),
  upper = list(continuous = wrap("cor", size = 4))
)
g <- g + theme(
  axis.text = element_text(size = 4),
  axis.title = element_text(size = 4),
  legend.background = element_rect(fill = "white"),
  panel.grid.major = element_line(colour = NA),
  panel.grid.minor = element_blank(),
  panel.background = element_rect(fill = "grey95")
)
print(g, bottomHeightProportion = 0.5, leftWidthProportion = .5)


Data interpretation

Correlations:

For the sake of clarity I chose one attribute which clearly correlates with other data variables and therefore explains the data well: Ado.Birth

The adolescent birth rate is the annual number of live births to adolescent women per 1,000 adolescent women. The adolescent birth rate is also referred to as the age-specific fertility rate for women aged 15-19

  • Edu2.FM: Strong negative correlation (-0.53) - countries, where women have children at very young age do not participate in education process (at all).

  • Edu.exp: Strong negative correlation (-0.70) - the expected years of schooling is very low if there is high adolescent birth rate. Children (males, females) are not participating in education process, because they reproduce quickly after puberty.

  • Life.exp: Strong negative correlation (-0.73) - expected individual lifespan decreases as adolescent birth rate increases. There are many issues related - having children without stable incomes and education does not predict good starting point for offspring.

  • GNI: trong negative correlation (-0.56) - GNI (country incomes) decrease while Ado.Birth increases: no education - no incomes.

  • Mat.Mor (def. The people of a society share a standard dignity through high morals making the moral fabric a keystone; keeping the arch of society and the morals it holds high-together. https://www.collinsdictionary.com/submission/8889/Moral+fabric) - positive correlation, adolescent birth rate increases Mat.Mor (does not make sense, I guess we should think that as Mat.Mor increases the moral of the country decreases)


Principal component analysis (PCA)

1. Unstandardized data

Next we will perform PCA-analysis for the HDI-dataset. We will be solving the components that capture most of the variability in the data.

First we perform PCA with unstandardized data.

human <- read.table("http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/human2.txt",sep  =",", header = T)
human = transform(human, GNI = as.numeric(GNI))
human = transform(human, Mat.Mor = as.numeric(Mat.Mor))
pca_human <- prcomp(human, scale=FALSE)
print("The variability captured by the principal components:")
## [1] "The variability captured by the principal components:"
pca_human
## Standard deviations (1, .., p=8):
## [1] 1.854416e+04 1.855219e+02 2.518701e+01 1.145441e+01 3.766241e+00
## [6] 1.565912e+00 1.912052e-01 1.591112e-01
## 
## Rotation (n x k) = (8 x 8):
##                     PC1           PC2           PC3           PC4
## Edu2.FM   -5.607472e-06  0.0006713951 -3.412027e-05 -2.736326e-04
## Labo.FM    2.331945e-07 -0.0002819357  5.302884e-04 -4.692578e-03
## Edu.Exp   -9.562910e-05  0.0075529759  1.427664e-02 -3.313505e-02
## Life.Exp  -2.815823e-04  0.0283150248  1.294971e-02 -6.752684e-02
## GNI       -9.999832e-01 -0.0057723054 -5.156742e-04  4.932889e-05
## Mat.Mor    5.655734e-03 -0.9916320120  1.260302e-01 -6.100534e-03
## Ado.Birth  1.233961e-03 -0.1255502723 -9.918113e-01  5.301595e-03
## Parli.F   -5.526460e-05  0.0032317269 -7.398331e-03 -9.971232e-01
##                     PC5           PC6           PC7           PC8
## Edu2.FM   -0.0022935252  2.180183e-02  6.998623e-01  7.139410e-01
## Labo.FM    0.0022190154  3.264423e-02  7.132267e-01 -7.001533e-01
## Edu.Exp    0.1431180282  9.882477e-01 -3.826887e-02  7.776451e-03
## Life.Exp   0.9865644425 -1.453515e-01  5.380452e-03  2.281723e-03
## GNI       -0.0001135863 -2.711698e-05 -8.075191e-07 -1.176762e-06
## Mat.Mor    0.0266373214  1.695203e-03  1.355518e-04  8.371934e-04
## Ado.Birth  0.0188618600  1.273198e-02 -8.641234e-05 -1.707885e-04
## Parli.F   -0.0716401914 -2.309896e-02 -2.642548e-03  2.680113e-03
biplot(pca_human)
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped


2. Standardized data

Next we will run PCA and plot results with standardized data.

Definition of scale: “log simply takes the logarithm (base e, by default) of each element of the vector. scale, with default settings, will calculate the mean and standard deviation of the entire vector, then”scale" each element by those values by subtracting the mean and dividing by the sd." (https://stackoverflow.com/questions/20256028/understanding-scale-in-r)


human <- read.table("http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/human2.txt",sep  =",", header = T)
human = transform(human, GNI = as.numeric(GNI))
human = transform(human, Mat.Mor = as.numeric(Mat.Mor))
human_std <- scale(human)
pca_human <- prcomp(human_std, scale=T)
print("The variability captured by the principal components:")
## [1] "The variability captured by the principal components:"
pca_human
## Standard deviations (1, .., p=8):
## [1] 2.0708380 1.1397204 0.8750485 0.7788630 0.6619563 0.5363061 0.4589994
## [8] 0.3222406
## 
## Rotation (n x k) = (8 x 8):
##                   PC1         PC2         PC3         PC4        PC5
## Edu2.FM   -0.35664370  0.03796058 -0.24223089  0.62678110 -0.5983585
## Labo.FM    0.05457785  0.72432726 -0.58428770  0.06199424  0.2625067
## Edu.Exp   -0.42766720  0.13940571 -0.07340270 -0.07020294  0.1659678
## Life.Exp  -0.44372240 -0.02530473  0.10991305 -0.05834819  0.1628935
## GNI       -0.35048295  0.05060876 -0.20168779 -0.72727675 -0.4950306
## Mat.Mor    0.43697098  0.14508727 -0.12522539 -0.25170614 -0.1800657
## Ado.Birth  0.41126010  0.07708468  0.01968243  0.04986763 -0.4672068
## Parli.F   -0.08438558  0.65136866  0.72506309  0.01396293 -0.1523699
##                   PC6         PC7         PC8
## Edu2.FM    0.17713316  0.05773644  0.16459453
## Labo.FM   -0.03500707 -0.22729927 -0.07304568
## Edu.Exp   -0.38606919  0.77962966 -0.05415984
## Life.Exp  -0.42242796 -0.43406432  0.62737008
## GNI        0.11120305 -0.13711838 -0.16961173
## Mat.Mor    0.17370039  0.35380306  0.72193946
## Ado.Birth -0.76056557 -0.06897064 -0.14335186
## Parli.F    0.13749772  0.00568387 -0.02306476
biplot(pca_human)


PCA - Conclusion

As we can recognize from figures above, PCA-analysis on unstandardized data can produce unwanted results - mainly because of variables with the highest sample variances tend to be emphasized in the first few principal components. Thus, principal component analysis using the covariance function should only be considered if all of the variables have the same units of measurement.

  • Results on PCA-analysis depend on measurement scales: it is recommended to standardize data before analysis. In our case, GNI (country income, values in thousands) over-emphasizes the data.


PCA -visualization continued

library(dplyr); library(factoextra)
## Welcome! Related Books: `Practical Guide To Cluster Analysis in R` at https://goo.gl/13EFCZ
human <- read.table("http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/human2.txt",sep  =",", header = T)
human = transform(human, GNI = as.numeric(GNI))
human = transform(human, Mat.Mor = as.numeric(Mat.Mor))
human_std <- scale(human)
pca_human <- prcomp(human_std, scale=T)
fviz_eig(pca_human)

s <- summary(pca_human)
print("Summary of PCA-analysis")
## [1] "Summary of PCA-analysis"
s
## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6
## Standard deviation     2.0708 1.1397 0.87505 0.77886 0.66196 0.53631
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595
## Cumulative Proportion  0.5361 0.6984 0.79413 0.86996 0.92473 0.96069
##                            PC7     PC8
## Standard deviation     0.45900 0.32224
## Proportion of Variance 0.02634 0.01298
## Cumulative Proportion  0.98702 1.00000
pca_pr <- round(1*s$importance[2, ], digits = 2)
print("The variability captured by the principal components:")
## [1] "The variability captured by the principal components:"
pca_pr
##  PC1  PC2  PC3  PC4  PC5  PC6  PC7  PC8 
## 0.54 0.16 0.10 0.08 0.05 0.04 0.03 0.01
#str(as.data.frame(as.table(human_std)))
fviz_pca_var(pca_human, col.var="contrib",
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
             repel = TRUE # Avoid text overlapping
             )

# Contributions of variables to PC1
fviz_contrib(pca_human, choice = "var", axes = 1, top = 10)

# Contributions of variables to PC2
fviz_contrib(pca_human, choice = "var", axes = 2, top = 10)

pca_pr = pca_pr*100
pc_lab = paste0(names(pca_pr), " (", pca_pr, "%)")
biplot(pca_human, cex = c(0.8, 1), col = c("grey40", "red"), xlab = pc_lab[1], ylab = pc_lab[2])


From figure above, first principal component (PC1) explains 54 % of the variance in data. Second component (PC2) explains much less, total 16 %.

Personal interpretation of 2 principal component dimensions

As we can see from the biplot above, it seems that data can be divided into three variable groups, which are:

  1. Edu.Exp, GNI, Edu2.FM, LIfe.exp

  2. Parli.F, Labo.FM

  3. Mat.Mor, Abo.Birth

If we have to choose two data dimensions (variables) to explain the data, I would choose Life.Exp and Mat.Mor, since they have the largest contribution in 1st. principal component (bar-chart above). Although, I believe second dimension could be taken from second component, such as Labo.FM, which contributes over 50 %.


Exercise with Tea-dataset

Data summary

library(FactoMineR);library(dplyr);library(MASS)
data(tea)
keep <- c("Tea", "How", "how", "sugar", "where", "lunch")
tea_time <- dplyr::select(tea, keep)
str(tea_time)
## 'data.frame':    300 obs. of  6 variables:
##  $ Tea  : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How  : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ how  : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ sugar: Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ where: Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ lunch: Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
summary(tea_time)
##         Tea         How                      how           sugar    
##  black    : 74   alone:195   tea bag           :170   No.sugar:155  
##  Earl Grey:193   lemon: 33   tea bag+unpackaged: 94   sugar   :145  
##  green    : 33   milk : 63   unpackaged        : 36                 
##                  other:  9                                          
##                   where           lunch    
##  chain store         :192   lunch    : 44  
##  chain store+tea shop: 78   Not.lunch:256  
##  tea shop            : 30                  
## 


Tea-data has 300 observations and originally 36 variables; we chose 6 variables for further examination. Every attribute is categorical: 2 are binary others have 3-4 classes.


Data visualization

library(FactoMineR);library(MASS);library(dplyr);library(tidyr);library(ggplot2)
data(tea)
keep_columns <- c("Tea", "How", "how", "sugar", "where", "lunch")
tea_time <- dplyr::select(tea, keep_columns)
gather(tea_time) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar()
## Warning: attributes are not identical across measure variables;
## they will be dropped


Multiple correspondence analysis

library(FactoMineR);library(MASS);library(dplyr);library(tidyr);library(ggplot2);library("factoextra")
data(tea)
keep_columns <- c("Tea", "How", "how", "sugar", "where", "lunch")
tea_time <- dplyr::select(tea, keep_columns)
mca <- MCA(tea_time, graph = FALSE)
summary(mca)
## 
## Call:
## MCA(X = tea_time, graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6
## Variance               0.279   0.261   0.219   0.189   0.177   0.156
## % of var.             15.238  14.232  11.964  10.333   9.667   8.519
## Cumulative % of var.  15.238  29.471  41.435  51.768  61.434  69.953
##                        Dim.7   Dim.8   Dim.9  Dim.10  Dim.11
## Variance               0.144   0.141   0.117   0.087   0.062
## % of var.              7.841   7.705   6.392   4.724   3.385
## Cumulative % of var.  77.794  85.500  91.891  96.615 100.000
## 
## Individuals (the 10 first)
##                       Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3
## 1                  | -0.298  0.106  0.086 | -0.328  0.137  0.105 | -0.327
## 2                  | -0.237  0.067  0.036 | -0.136  0.024  0.012 | -0.695
## 3                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 4                  | -0.530  0.335  0.460 | -0.318  0.129  0.166 |  0.211
## 5                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 6                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 7                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 8                  | -0.237  0.067  0.036 | -0.136  0.024  0.012 | -0.695
## 9                  |  0.143  0.024  0.012 |  0.871  0.969  0.435 | -0.067
## 10                 |  0.476  0.271  0.140 |  0.687  0.604  0.291 | -0.650
##                       ctr   cos2  
## 1                   0.163  0.104 |
## 2                   0.735  0.314 |
## 3                   0.062  0.069 |
## 4                   0.068  0.073 |
## 5                   0.062  0.069 |
## 6                   0.062  0.069 |
## 7                   0.062  0.069 |
## 8                   0.735  0.314 |
## 9                   0.007  0.003 |
## 10                  0.643  0.261 |
## 
## Categories (the 10 first)
##                        Dim.1     ctr    cos2  v.test     Dim.2     ctr
## black              |   0.473   3.288   0.073   4.677 |   0.094   0.139
## Earl Grey          |  -0.264   2.680   0.126  -6.137 |   0.123   0.626
## green              |   0.486   1.547   0.029   2.952 |  -0.933   6.111
## alone              |  -0.018   0.012   0.001  -0.418 |  -0.262   2.841
## lemon              |   0.669   2.938   0.055   4.068 |   0.531   1.979
## milk               |  -0.337   1.420   0.030  -3.002 |   0.272   0.990
## other              |   0.288   0.148   0.003   0.876 |   1.820   6.347
## tea bag            |  -0.608  12.499   0.483 -12.023 |  -0.351   4.459
## tea bag+unpackaged |   0.350   2.289   0.056   4.088 |   1.024  20.968
## unpackaged         |   1.958  27.432   0.523  12.499 |  -1.015   7.898
##                       cos2  v.test     Dim.3     ctr    cos2  v.test  
## black                0.003   0.929 |  -1.081  21.888   0.382 -10.692 |
## Earl Grey            0.027   2.867 |   0.433   9.160   0.338  10.053 |
## green                0.107  -5.669 |  -0.108   0.098   0.001  -0.659 |
## alone                0.127  -6.164 |  -0.113   0.627   0.024  -2.655 |
## lemon                0.035   3.226 |   1.329  14.771   0.218   8.081 |
## milk                 0.020   2.422 |   0.013   0.003   0.000   0.116 |
## other                0.102   5.534 |  -2.524  14.526   0.197  -7.676 |
## tea bag              0.161  -6.941 |  -0.065   0.183   0.006  -1.287 |
## tea bag+unpackaged   0.478  11.956 |   0.019   0.009   0.000   0.226 |
## unpackaged           0.141  -6.482 |   0.257   0.602   0.009   1.640 |
## 
## Categorical variables (eta2)
##                      Dim.1 Dim.2 Dim.3  
## Tea                | 0.126 0.108 0.410 |
## How                | 0.076 0.190 0.394 |
## how                | 0.708 0.522 0.010 |
## sugar              | 0.065 0.001 0.336 |
## where              | 0.702 0.681 0.055 |
## lunch              | 0.000 0.064 0.111 |
#
fviz_screeplot(mca, addlabels = TRUE, ylim = c(0, 45))

fviz_contrib(mca, choice = "var", axes = 1, top = 10)

fviz_contrib(mca, choice = "var", axes = 2, top = 10)

plot(mca, invisible=c("ind"))

plot(mca, invisible=c("ind"),habillage = "quali")

fviz_mca_var(mca, col.var = "cos2",
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"), 
             repel = TRUE, # Avoid text overlapping
             ggtheme = theme_minimal())

fviz_mca_ind(mca, 
             label = "none", # hide individual labels
             habillage = "how", # color by groups 
             palette = c("#00AFBB", "#E7B800","#d9b3ff"),
             addEllipses = TRUE, ellipse.type = "confidence",
             ggtheme = theme_minimal())


MCA-analysis conclusion


Conclusion of the MCA-analysis on tea-data is that data can be described very poorly by any individual dimension (categorical class), as we can see from the first bar-plot: all the variables are somewhat equal - from 15 % to 5 %. In the first dimensions, most of the data variance is caught by attributes “where” and “how”. Both explain variance over 20 %.

In Variable-categories figure, we are able to recognize that how- and where -related var. are well represented by our 2 dimensions (MCA). Cos2-value defines how well category can be explained by components produced by MCA.

In the last figure, I plotted the observations in MCA-dimensions and categorized them with “how”- variable: it seems the data can be described fairly well with those 3 categories. Same phenomenom can be seen with “where”-variable.



Analysis of longitudinal data

Part 1: Analysis of RATS data


RATS data overview

RATS data consists of rat weight measurements over 9 weeks period. Rats were divided into three groups, where each group had individual nutrient diet. Aim of the study was to solve the differences between diets through weight development of study rats.

Data has total of 16 rats, divided somewhat equally to the diet groups. Group 1 had 8 obs. and other 2 groups had only 4.

Weight was measured in 9 phases (longitudinal). More detailed information is shown below.


library(dplyr)
library(tidyr)
RATS <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/rats.txt", header = TRUE, sep = '\t')
RATS$ID <- factor(RATS$ID)
RATS$Group <- factor(RATS$Group)
names(RATS)
##  [1] "ID"    "Group" "WD1"   "WD8"   "WD15"  "WD22"  "WD29"  "WD36" 
##  [9] "WD43"  "WD44"  "WD50"  "WD57"  "WD64"
summary(RATS)
##        ID     Group      WD1             WD8             WD15      
##  1      : 1   1:8   Min.   :225.0   Min.   :230.0   Min.   :230.0  
##  2      : 1   2:4   1st Qu.:252.5   1st Qu.:255.0   1st Qu.:255.0  
##  3      : 1   3:4   Median :340.0   Median :345.0   Median :347.5  
##  4      : 1         Mean   :365.9   Mean   :369.1   Mean   :372.5  
##  5      : 1         3rd Qu.:480.0   3rd Qu.:476.2   3rd Qu.:486.2  
##  6      : 1         Max.   :555.0   Max.   :560.0   Max.   :565.0  
##  (Other):10                                                        
##       WD22            WD29            WD36            WD43      
##  Min.   :232.0   Min.   :240.0   Min.   :240.0   Min.   :243.0  
##  1st Qu.:267.2   1st Qu.:268.8   1st Qu.:267.2   1st Qu.:269.2  
##  Median :351.5   Median :356.5   Median :360.0   Median :360.0  
##  Mean   :379.2   Mean   :383.9   Mean   :387.0   Mean   :386.0  
##  3rd Qu.:492.5   3rd Qu.:497.8   3rd Qu.:504.2   3rd Qu.:501.0  
##  Max.   :580.0   Max.   :590.0   Max.   :597.0   Max.   :595.0  
##                                                                 
##       WD44            WD50            WD57            WD64      
##  Min.   :244.0   Min.   :238.0   Min.   :247.0   Min.   :245.0  
##  1st Qu.:270.0   1st Qu.:273.8   1st Qu.:273.8   1st Qu.:278.0  
##  Median :362.0   Median :370.0   Median :373.5   Median :378.0  
##  Mean   :388.3   Mean   :394.6   Mean   :398.6   Mean   :404.1  
##  3rd Qu.:510.5   3rd Qu.:516.0   3rd Qu.:524.5   3rd Qu.:530.8  
##  Max.   :595.0   Max.   :612.0   Max.   :618.0   Max.   :628.0  
## 
glimpse(RATS)
## Observations: 16
## Variables: 13
## $ ID    <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3
## $ WD1   <int> 240, 225, 245, 260, 255, 260, 275, 245, 410, 405, 445, 5...
## $ WD8   <int> 250, 230, 250, 255, 260, 265, 275, 255, 415, 420, 445, 5...
## $ WD15  <int> 255, 230, 250, 255, 255, 270, 260, 260, 425, 430, 450, 5...
## $ WD22  <int> 260, 232, 255, 265, 270, 275, 270, 268, 428, 440, 452, 5...
## $ WD29  <int> 262, 240, 262, 265, 270, 275, 273, 270, 438, 448, 455, 5...
## $ WD36  <int> 258, 240, 265, 268, 273, 277, 274, 265, 443, 460, 455, 5...
## $ WD43  <int> 266, 243, 267, 270, 274, 278, 276, 265, 442, 458, 451, 5...
## $ WD44  <int> 266, 244, 267, 272, 273, 278, 271, 267, 446, 464, 450, 5...
## $ WD50  <int> 265, 238, 264, 274, 276, 284, 282, 273, 456, 475, 462, 6...
## $ WD57  <int> 272, 247, 268, 273, 278, 279, 281, 274, 468, 484, 466, 6...
## $ WD64  <int> 278, 245, 269, 275, 280, 281, 284, 278, 478, 496, 472, 6...


Plot of individual rat growth profiles


library(dplyr)
library(tidyr)
library(ggplot2)

RATS <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/rats.txt", header = TRUE, sep = '\t')
RATS$ID <- factor(RATS$ID)
RATS$Group <- factor(RATS$Group)

RATSL <- RATS %>%
  gather(key = WD, value = Weight, -ID, -Group) %>%
  mutate(Time = as.integer(substr(WD,3,4))) 
ggplot(RATSL, aes(x = Time, y = Weight, group = ID,color=Group)) +
  geom_line() +
  geom_line(aes(linetype = Group)) +
  scale_x_continuous(name = "Time (days)", breaks = seq(0, 60, 10)) +
  scale_y_continuous(name = "Weight (grams)") +
  theme(legend.position = "top")


Note From the individual rat growth profile, we are able to recognize that the diet group 1 are generally lighter, when compared to other two groups. Also, its easy to see that the rats in group 1 get weight much less than the rats in other groups.


Scatterplot matrix of repeated measures in rat growth data


RATS <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/rats.txt", header = TRUE, sep = '\t')
RATS$ID <- factor(RATS$ID)
RATS$Group <- factor(RATS$Group)

RATSL <- RATS %>%
  gather(key = WD, value = Weight, -ID, -Group) %>%
  mutate(Time = as.integer(substr(WD,3,4))) 
pairs(RATS[,3:11], 
   main="Scatterplot matrix of repeated measures in rat growth data.")


Note From scatterplot matrix we are able to recognize that there are dependencies between the measurements made at different points of time: this is of course obvious since the rats who are getting weight - will most likely get weight on upcoming measurements aswell

Longitudinal data, where a response variable is measured on each subject on several different occasions poses problems for their analysis because the repeated measurements on each subject are very likely to be correlated rather than independent. (MABS4IODS, part VI)


Fitted growth rate profiles from the interaction model and observed growth rate profiles


Next we will create a random intercept and random slope model with the interaction (Group x Time Interaction to Rat Growth Data) and compare it to model without interaction (anova table).


library(dplyr)
library(tidyr)
library(ggplot2)
library(lme4)
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## The following object is masked from 'package:tidyr':
## 
##     expand
library(cowplot)
## 
## Attaching package: 'cowplot'
## The following object is masked from 'package:ggplot2':
## 
##     ggsave
RATS <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/rats.txt", header = TRUE, sep = '\t')
RATS$ID <- factor(RATS$ID)
RATS$Group <- factor(RATS$Group)

RATSL <- RATS %>%
  gather(key = WD, value = Weight, -ID, -Group) %>%
  mutate(Time = as.integer(substr(WD,3,4))) 

g1 <- ggplot(RATSL, aes(x = Time, y = Weight, group = ID, color=Group)) +
  geom_line(aes(linetype = Group)) +
  scale_x_continuous(name = "Time (days)", breaks = seq(0, 60, 20)) +
  scale_y_continuous(name = "Observed weight (grams)") +
  theme(legend.position = "top")

RATS_ref1 <- lmer(Weight ~ Time + Group + (Time | ID), data = RATSL, REML = FALSE)

RATS_ref2 <- lmer(Weight ~ Time * Group + (Time | ID), data = RATSL, REML = FALSE)
anova(RATS_ref2, RATS_ref1)
## Data: RATSL
## Models:
## RATS_ref1: Weight ~ Time + Group + (Time | ID)
## RATS_ref2: Weight ~ Time * Group + (Time | ID)
##           Df    AIC    BIC  logLik deviance  Chisq Chi Df Pr(>Chisq)   
## RATS_ref1  8 1194.2 1219.6 -589.11   1178.2                            
## RATS_ref2 10 1185.9 1217.6 -582.93   1165.9 12.361      2    0.00207 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Fitted <- fitted(RATS_ref2)

RATSL <- RATSL  %>%
  mutate(fitted = as.numeric(Fitted)) 

g2 <- ggplot(RATSL, aes(x = Time, y = fitted, group = ID, color=Group)) +
  geom_line(aes(linetype = Group)) +
  scale_x_continuous(name = "Time (days)", breaks = seq(0, 60, 20)) +
  scale_y_continuous(name = "Fitted weight (grams)") +
  theme(legend.position = "top")

plot_grid(g1, g2, labels = "AUTO")


Note From the Anova, we are able to conclude that interaction is 12.36 (Chisq Chi) with 2 DF; the associated p-value is very small, and we can conclude that the interaction model provides a better fit for the rat growth data (MABS4IODS)

Stat. signif. is 0.00207 **.


Part 2: Analysis of BPRS data


BPRS data overview

Likewise the earlier RATS data, BPRS (Brief Psychiatric Rating Scale?) data has measurements made over time. The data consists of 40 subjects who have been interviewed by doctors for 9 weeks (one measurement each week). Measurements are presented by bprs-value.

Below is the summary of the dataframe.


BPRS <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/BPRS.txt", sep  =" ", header = T)

names(BPRS) 
##  [1] "treatment" "subject"   "week0"     "week1"     "week2"    
##  [6] "week3"     "week4"     "week5"     "week6"     "week7"    
## [11] "week8"
dim(BPRS)
## [1] 40 11
summary(BPRS)
##    treatment      subject          week0           week1      
##  Min.   :1.0   Min.   : 1.00   Min.   :24.00   Min.   :23.00  
##  1st Qu.:1.0   1st Qu.: 5.75   1st Qu.:38.00   1st Qu.:35.00  
##  Median :1.5   Median :10.50   Median :46.00   Median :41.00  
##  Mean   :1.5   Mean   :10.50   Mean   :48.00   Mean   :46.33  
##  3rd Qu.:2.0   3rd Qu.:15.25   3rd Qu.:58.25   3rd Qu.:54.25  
##  Max.   :2.0   Max.   :20.00   Max.   :78.00   Max.   :95.00  
##      week2          week3           week4           week5      
##  Min.   :26.0   Min.   :24.00   Min.   :20.00   Min.   :20.00  
##  1st Qu.:32.0   1st Qu.:29.75   1st Qu.:28.00   1st Qu.:26.00  
##  Median :38.0   Median :36.50   Median :34.50   Median :30.50  
##  Mean   :41.7   Mean   :39.15   Mean   :36.35   Mean   :32.55  
##  3rd Qu.:49.0   3rd Qu.:44.50   3rd Qu.:43.00   3rd Qu.:38.00  
##  Max.   :75.0   Max.   :76.00   Max.   :66.00   Max.   :64.00  
##      week6           week7          week8      
##  Min.   :19.00   Min.   :18.0   Min.   :20.00  
##  1st Qu.:22.75   1st Qu.:23.0   1st Qu.:22.75  
##  Median :28.50   Median :30.0   Median :28.00  
##  Mean   :31.23   Mean   :32.2   Mean   :31.43  
##  3rd Qu.:37.00   3rd Qu.:38.0   3rd Qu.:35.25  
##  Max.   :64.00   Max.   :62.0   Max.   :75.00

Individual response profiles by treatment group for the BPRS data


library(dplyr)
library(tidyr)
library(ggplot2)
BPRS <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/BPRS.txt", sep  =" ", header = T)
BPRS$treatment <- factor(BPRS$treatment)
BPRS$subject <- factor(BPRS$subject)
BPRSL <-  BPRS %>% gather(key = weeks, value = bprs, -treatment, -subject)
BPRSL <-  BPRSL %>% mutate(week = as.integer(substr(weeks,5,6)))
ggplot(BPRSL, aes(x = week, y = bprs, linetype = subject)) +
  geom_line() +
  scale_linetype_manual(values = rep(1:10, times=4)) +
  facet_grid(. ~ treatment, labeller = label_both) +
  theme(legend.position = "none") + 
  scale_y_continuous(limits = c(min(BPRSL$bprs), max(BPRSL$bprs)))


Individual response profiles for BPRS data after standardization.


library(dplyr)
library(tidyr)
library(ggplot2)
BPRS <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/BPRS.txt", sep  =" ", header = T)
BPRS$treatment <- factor(BPRS$treatment)
BPRS$subject <- factor(BPRS$subject)
BPRSL <-  BPRS %>% gather(key = weeks, value = bprs, -treatment, -subject)
BPRSL <-  BPRSL %>% mutate(week = as.integer(substr(weeks,5,6)))

BPRSL <- BPRSL %>%
  group_by(week) %>%
  mutate(stdbprs = as.numeric(scale(bprs))) %>%
  ungroup()
ggplot(BPRSL, aes(x = week, y = stdbprs, linetype = subject)) +
  geom_line() +
  scale_linetype_manual(values = rep(1:10, times=4)) +
  facet_grid(. ~ treatment, labeller = label_both) +
  theme(legend.position = "none") + 
  scale_y_continuous(name = "standardized bprs")


Note The tracking phenomenom can be seen from the standardized values of each observation. There are substantial individual differences and variability appears to decrease with time. (MABS4IODS)

Mean response profiles for the two treatment groups in the BPRS data


Better way to graphically represent the differences between observations, is to plot average profiles of each treatment group.


BPRS <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/BPRS.txt", sep  =" ", header = T)
BPRS$treatment <- factor(BPRS$treatment)
BPRS$subject <- factor(BPRS$subject)
BPRSL <-  BPRS %>% gather(key = weeks, value = bprs, -treatment, -subject)
BPRSL <-  BPRSL %>% mutate(week = as.integer(substr(weeks,5,6)))
BPRSL <- BPRSL %>%
  group_by(week) %>%
  mutate(stdbprs = as.numeric(scale(bprs))) %>%
  ungroup()
n <- BPRSL$week %>% unique() %>% length()

BPRSS <- BPRSL %>%
  group_by(treatment, week) %>%
  summarise(mean = mean(bprs), se = sd(bprs)/sqrt(n)) %>%
  ungroup()
ggplot(BPRSS, aes(x = week, y = mean, linetype = treatment, shape = treatment)) +
  geom_line() +
  scale_linetype_manual(values = c(1,2)) +
  geom_point(size=3) +
  scale_shape_manual(values = c(1,2)) +
  geom_errorbar(aes(ymin=mean-se, ymax=mean+se, linetype="1"), width=0.3) +
  theme(legend.position = c(0.8,0.8)) +
  scale_y_continuous(name = "mean(bprs) +/- se(bprs)")


From mean response profiles for the two treatment groups in the BPRS data, we are able to figure out that there seems to be slight overlap between profiles. It indicates that there are differences between treatment groups.

Boxplots for the BPRS data


BPRS <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/BPRS.txt", sep  =" ", header = T)
BPRS$treatment <- factor(BPRS$treatment)
BPRS$subject <- factor(BPRS$subject)
BPRSL <-  BPRS %>% gather(key = weeks, value = bprs, -treatment, -subject)
BPRSL <-  BPRSL %>% mutate(week = as.integer(substr(weeks,5,6)))
BPRSL <- BPRSL %>%
  group_by(week) %>%
  mutate(stdbprs = as.numeric(scale(bprs))) %>%
  ungroup()
n <- BPRSL$week %>% unique() %>% length()

BPRSS <- BPRSL %>%
  group_by(treatment, week) %>%
  summarise(mean = mean(bprs), se = sd(bprs)/sqrt(n)) %>%
  ungroup()
ggplot(data = BPRSL, aes(x=weeks, y=bprs)) + geom_boxplot(aes(fill=treatment))


Note Mean profiles can be graphically shown as boxplots where treatment groups are side-by-side. Plot clearly shows that there might exist differences between the treatment groups and outliers (points). Also, there is clear decline in bprs-values over time. Treatments seem to be working for patients.


Boxplot of the mean versus treatment


library(dplyr)
library(tidyr)
library(ggplot2)
library(lme4)
library(cowplot)
BPRS <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/BPRS.txt", sep  =" ", header = T)
BPRS$treatment <- factor(BPRS$treatment)
BPRS$subject <- factor(BPRS$subject)
BPRSL <-  BPRS %>% gather(key = weeks, value = bprs, -treatment, -subject)
BPRSL <-  BPRSL %>% mutate(week = as.integer(substr(weeks,5,6)))
BPRSL <- BPRSL %>%
  group_by(week) %>%
  mutate(stdbprs = as.numeric(scale(bprs))) %>%
  ungroup()
n <- BPRSL$week %>% unique() %>% length()
BPRSS <- BPRSL %>%
  group_by(treatment, week) %>%
  summarise(mean = mean(bprs), se = sd(bprs)/sqrt(n)) %>%
  ungroup()

BPRSL8S <- BPRSL %>%
  filter(week > 0) %>%
  group_by(treatment, subject) %>%
  summarise( mean=mean(bprs) ) %>%
  ungroup()
g1 <- ggplot(BPRSL8S, aes(x = treatment, y = mean)) +
  geom_boxplot() +
  ggtitle("Original")+
  stat_summary(fun.y = "mean", geom = "point", shape=23, size=4, fill = "white") +
  scale_y_continuous(name = "mean(bprs), weeks 1-8")
BPRSL8S1 <- BPRSL8S  %>%
  filter(subject != 11)

g2 <- ggplot(BPRSL8S1, aes(x = treatment, y = mean)) +
  geom_boxplot() +
  ggtitle("Outlier filtered")+
  stat_summary(fun.y = "mean", geom = "point", shape=23, size=4, fill = "white") +
  scale_y_continuous(name = "mean(bprs), weeks 1-8")
plot_grid(g1, g2, labels = "AUTO")


T-test and Anova (with baseline)


BPRS <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/BPRS.txt", sep  =" ", header = T)
BPRS$treatment <- factor(BPRS$treatment)
BPRS$subject <- factor(BPRS$subject)
BPRSL <-  BPRS %>% gather(key = weeks, value = bprs, -treatment, -subject)
BPRSL <-  BPRSL %>% mutate(week = as.integer(substr(weeks,5,6)))
BPRSL <- BPRSL %>%
  group_by(week) %>%
  mutate(stdbprs = as.numeric(scale(bprs))) %>%
  ungroup()
n <- BPRSL$week %>% unique() %>% length()
BPRSS <- BPRSL %>%
  group_by(treatment, week) %>%
  summarise(mean = mean(bprs), se = sd(bprs)/sqrt(n)) %>%
  ungroup()

BPRSL8S <- BPRSL %>%
  filter(week > 0) %>%
  group_by(treatment, subject) %>%
  summarise( mean=mean(bprs) ) %>%
  ungroup()

t.test(mean ~ treatment, data = BPRSL8S1, var.equal = TRUE)
## 
##  Two Sample t-test
## 
## data:  mean by treatment
## t = 0.50364, df = 36, p-value = 0.6176
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -4.400946  7.308841
## sample estimates:
## mean in group 1 mean in group 2 
##        36.15789        34.70395
# Add the baseline from the original data as a new variable to the summary data
BPRSL8S2 <- BPRSL8S %>%
  mutate(baseline = BPRS$week0)

# Fit the linear model with the mean as the response 
fit <- lm(mean ~ baseline + treatment, data = BPRSL8S2)

# Compute the analysis of variance table for the fitted model with anova()
anova = anova(fit)
anova
## Analysis of Variance Table
## 
## Response: mean
##           Df  Sum Sq Mean Sq F value    Pr(>F)    
## baseline   1 1868.07 1868.07 30.1437 3.077e-06 ***
## treatment  1    3.45    3.45  0.0557    0.8148    
## Residuals 37 2292.97   61.97                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


T-test shatters my earlier conclusion (boxplots) that there would be statistical differences between the treatment groups. p-value = 0.6176

From Anova-table we are able to recognize that bprs-values of each subject are strongly correlated with the baseline (week 0) bprs-values. Although, there is no statistical significance between the treatment groups. p-value is 0.8148.